{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<center><h1> State space models</h1></center>\n",
    "\n",
    "# 1. Basic\n",
    "A **state space model (SSM)** is just like an HMM,except the hidden states are continuous. The model can be written in the following generic form\n",
    "\\begin{align}\n",
    "z_t &=g(u_t,z_{t-1},\\epsilon_t) \\\\\n",
    "y_t &=h(z_t,u_t,\\delta_t)\n",
    "\\end{align}\n",
    "where $z_t$ is the hidden state,$u_t$ is an optional input,$y_t$ is the observation,$g$ is the **transition model**, $h$ is the **observation model**, $\\epsilon_t$ is the system noise at time $t$, and $\\delta_t$ is the observation noise at time $t$.\n",
    "\n",
    "One of the primary goals in using SSMs is to recursively estimate the belief state $p(z_t|y_{1:t},u_{1:t},\\theta)$ (we will often drop the conditioning on $u$ and $\\theta$ for brevity) and then we can convert our beliefs about the hidden state into predictions about future observations by computing the posterior predictive\n",
    "$$\n",
    "p(y_{t+1}|y_{1:t})=\\int\\int p(z_t|y_{1:t})p(z_{t+1}|z_t)p(y_{t+1}|z_{t+1})dz_tdz_{t+1}\n",
    "$$\n",
    "An important special case of an SSM is where all conditional distributions are linear Gaussian. We assume\n",
    "* The transition model is a linear function\n",
    "$$\n",
    "z_t=\\mathbf{A}_tz_{t-1}+\\mathbf{B}_tu_t+\\epsilon_t\n",
    "$$\n",
    "* The observation model is a linear function\n",
    "$$\n",
    "y_t=\\mathbf{C}_tz_t+\\mathbf{D}_tu_t+\\delta_t\n",
    "$$\n",
    "* The system noise is Gaussian\n",
    "$$\n",
    "\\epsilon_t \\sim \\mathcal{N}(0,\\mathbf{Q}_t)\n",
    "$$\n",
    "* The observation noise is Gaussian \n",
    "$$\n",
    "\\delta_t \\sim \\mathcal{N}(0,\\mathbf{R}_t)\n",
    "$$\n",
    "\n",
    "This model is called a **linear-Gaussian SSM (LG-SSM)** or a **linear dynamical system (LDS)**. If the parameter $\\vec{\\theta}_t= \\left(\\mathbf{A}_t,\\mathbf{B}_t,\\mathbf{C}_t,\\mathbf{D}_t,\\mathbf{Q}_t,\\mathbf{R}_t\\right)$ are independent of time, the model is called **stationary**.\n",
    "\n",
    "The **LG-SSM** is important  because it supports exact inference. In particular,if the initial belief state is Gaussian,$p(z_1)=\\mathcal{N}(\\mu_{1|0},\\Sigma_{1|0})$,then all subsequent belief states will also be Gaussian $p(z_t|y_{1:t})=\\mathcal{N}(\\mu_{t|t},\\Sigma_{t|t})$. We can compute these quantities efficiently using the celebrated Kalman filter which will be discussed later.\n",
    "\n",
    "# 2. Application of SSMs\n",
    "## 2.1 SSMs for object tracking\n",
    "One of the earliest applications of Kalman filtering was for tracking objects, such as airplanes and missiles,from noisy measurements, such as radar. Here we give a simplified example to illustrate the key ideas. Consider an object moving in a 2D plane.Let $z_{1t}$ and $z_{2t}$ be the horizontal and vertical locations of the object, and $\\dot{z}_{1t}$ and $\\dot{z}_{2t}$ be the corresponding velocity. we can represent this as a state vector\n",
    "$$\n",
    "\\vec{z}_t=\\left(z_{1t},z_{2t},\\dot{z}_{1t},\\dot{z}_{2t} \\right)\n",
    "$$\n",
    "Let us assume that the object is moving at constant velocity,but is \"perturbed\" by random Gaussian noise. Thus we can model the system dynamics as follows.\n",
    "\\begin{align}\n",
    "\\vec{z}_t      &=\\mathbf{A}_t\\vec{z}_{t-1}+\\vec{\\epsilon}_t \\\\\n",
    "\\begin{bmatrix}\n",
    "z_{1t} \\\\\n",
    "z_{2t} \\\\\n",
    "\\dot{z}_{1t} \\\\\n",
    "\\dot{z}_{2t} \n",
    "\\end{bmatrix}  \n",
    "&=\\begin{bmatrix}\n",
    "1 & 0 & \\Delta t & 0 \\\\\n",
    "0 & 1 & 0 & \\Delta t \\\\\n",
    "0 & 0 & 1 & 0 \\\\\n",
    "0 & 0 & 0 & 1 \n",
    "\\end{bmatrix}\n",
    "\\begin{bmatrix}\n",
    "z_{1,t-1} \\\\\n",
    "z_{2,t-1} \\\\\n",
    "\\dot{z}_{1,t-1} \\\\\n",
    "\\dot{z}_{2,t-2} \n",
    "\\end{bmatrix}\n",
    "+\n",
    "\\begin{bmatrix}\n",
    "\\epsilon_{1t} \\\\\n",
    "\\epsilon_{2t} \\\\\n",
    "\\epsilon_{3t} \\\\\n",
    "\\epsilon_{4t} \n",
    "\\end{bmatrix}\n",
    "\\end{align}\n",
    "where $\\vec{\\epsilon}_t \\sim \\mathcal{N}(0,\\mathbf{Q})$ is the system noise, and $\\Delta$ is the **sampling period**.\n",
    "\n",
    "Now suppose that we can observe the location of the object but not its velocity. Let $\\vec{y}_t \\in R^2$ represent our observation, which we assume \n",
    "\\begin{align}\n",
    "\\vec{y}_t &=\\mathbf{C}_t \\vec{z}_t+\\vec{\\delta}_t \\\\\n",
    "\\begin{bmatrix}\n",
    "y_{1t} \\\\\n",
    "y_{2t}\n",
    "\\end{bmatrix} &=\n",
    "\\begin{bmatrix}\n",
    "1 & 0 & 0 & 0 \\\\\n",
    "0 & 1 & 0 & 0 \n",
    "\\end{bmatrix}\n",
    "\\begin{bmatrix}\n",
    "z_{1t} \\\\\n",
    "z_{2t} \\\\\n",
    "\\dot{z}_{1t} \\\\\n",
    "\\dot{z}_{2t} \n",
    "\\end{bmatrix} \n",
    "+\n",
    "\\begin{bmatrix}\n",
    "\\delta_{1t} \\\\\n",
    "\\delta_{2t}\n",
    "\\end{bmatrix}\n",
    "\\end{align}\n",
    "where $\\vec{\\delta}_t$ is the measurement noise.\n",
    "Finally, we need to specify our initial (prior) beliefs about the state of the object, $p(\\vec{z}_1)$. We will assume this is a Gaussian,$p(\\vec{z}_1)=\\mathcal{N}(\\vec{z}_1|\\vec{\\mu}_{1|0},\\Sigma_{1|0})$. We can represent prior ignorance by making $\\Sigma_{1|0}$ suitably \"board\", e.g., $\\Sigma_{1|0}=\\infty \\mathbf{I}$. We have now fully specified the model and can perform sequential Bayesian updating to compute $p(z_t|y_{1:t})$ using Kalman filter.\n",
    "<img src=\"imgs/1.png\" alt=\"drawing\" width=\"900\"/>\n",
    "## 2.2 Online parameter learning using recursive least squares\n",
    "We can perform online Bayesian inference for the parameters of various statistical models using SSMs. Taking linear regression as an example. The basic idea is to let the hidden state represent the regression parameter, and to let the (time-varying) observation model represent the current data vector. Define the prior to be $p(\\vec{\\theta})=\\mathcal{N}(\\vec{\\theta}|\\vec{\\theta}_0,\\Sigma_0)$. Let the hidden state be $\\vec{z}_t=\\vec{\\theta}$, if we assume the regression parameters do not change, we can set $\\mathbf{A}_t=\\mathbf{I}$ and $\\mathbf{Q}_t=0\\mathbf{I}$, so\n",
    "$$\n",
    "p(\\vec{\\theta}_t|\\vec{\\theta}_{t-1})=\\mathcal{N}(\\vec{\\theta}_t|\\vec{\\theta}_{t-1},0\\mathbf{I})=\\delta_{\\vec{\\theta}_{t-1}}(\\vec{\\theta}_t)\n",
    "$$\n",
    "Let $\\mathbf{C}_t=\\vec{x}_t^T$ and $\\mathbf{R}_t=\\sigma^2$, so the observation model has the form\n",
    "$$\n",
    "\\mathcal{N}(y_t|\\mathbf{C}_t\\vec{z}_t,\\mathbf{R}_t)=\\mathcal{N}(y_t|\\vec{x}_t^T\\vec{\\theta}_t,\\sigma^2)\n",
    "$$\n",
    "Applying the Kalman filter to this model provides a way to update our posterior beliefs about the parameters as the data streams in. This is known as the **recursive least squares (RLS)** algorihtm.\n",
    "\n",
    "# 3. Inference in LG-SSM\n",
    "## 3.1 The Kalman filtering algorithm\n",
    "The **Kalman filtering** is an algorithm for exact Baysian filtering for linear-Gaussian state space models, which is similar to the forward algorithm in HMM. We will represent the marginal posterior at time t by\n",
    "$$\n",
    "p(z_t|y_{1:t},u_{1:t})=\\mathcal{N}(z_t|\\mu_t,\\Sigma_t)\n",
    "$$\n",
    "Since everything is Gaussian, we can perform the prediction and update steps in closed form.\n",
    "\\begin{align}\n",
    "p(z_t|y_{1:t},u_{1:t}) &=p(z_t|y_{1:t-1},y_t,u_{1:t}) \\\\\n",
    "                    &=\\frac{p(y_t|z_t,y_{1:t-1},u_{1:t})p(z_t|y_{1:t-1},u_{1:t})}{p(y_t|y_{1:t-1},u_{1:t})} \\\\\n",
    "                    &=\\frac{p(y_t|z_t,u_t)p(z_t|y_{1:t-1},u_{1:t})}{p(y_t|y_{1:t-1},u_{1:t})} \\\\\n",
    "\\end{align}\n",
    "### Prediction step\n",
    "The prediction step is straightforward to derive\n",
    "\\begin{align}\n",
    "p(z_t|y_{1:t-1},u_{1:t}) &=\\int p(z_t|z_{t-1})p(z_{t-1}|y_{1:t-1},u_{1:t}) dz_{t-1} \\\\\n",
    "                         &=\\int \\mathcal{N}(z_t|\\mathbf{A}_tz_{t-1}+\\mathbf{B}_tu_t,\\mathbf{Q}_t)\\mathcal{N}(z_{t-1}|\\mu_{t-1},\\Sigma_{t-1}) dz_{t-1} \\\\\n",
    "                         &=\\mathcal{N}(z_t|\\mu_{t|t-1},\\Sigma_{t|t-1}) \\\\\n",
    "             \\mu_{t|t-1} &=\\mathbf{A}_t\\mu_{t-1}+\\mathbf{B}_tu_t \\\\\n",
    "           \\Sigma_{t|t-1}&=\\mathbf{A}_t\\Sigma_{t-1}\\mathbf{A}_t^T+\\mathbf{Q}_t \\\\\n",
    "\\end{align}\n",
    "### Measurement step\n",
    "The measurement step can be computed using Bayes rule, as follows\n",
    "\\begin{align}\n",
    "p(z_t|y_{1:t},u_{1:t}) &=\\frac{p(y_t|z_t,u_t)p(z_t|y_{1:t-1},u_{1:t})}{p(y_t|y_{1:t-1},u_{1:t})} \\\\\n",
    "                       & \\propto p(y_t|z_t,u_t)p(z_t|y_{1:t-1},u_{1:t}) \\\\\n",
    "                       &= \\mathcal{N}(y_t|\\mathbf{C}_tz_t+\\mathbf{D}_tu_t,\\mathbf{R}_t)\\mathcal{N}(z_t|\\mu_{t|t-1},\\Sigma_{t|t-1}) \\\\\n",
    "                       &=\\mathcal{N}(z_t|\\mu_t,\\Sigma_t) \\\\\n",
    "               \\mu_t   &=\\mu_{t|t-1}+\\mathbf{K}_tr_t \\\\\n",
    "              \\Sigma_t &=(\\mathbf{I}-\\mathbf{K}_t\\mathbf{C}_t)\\Sigma_{t|t-1} \\\\\n",
    "\\end{align}\n",
    "where $r_t$ is the **residual** or **innovation**, given by the difference between our predicted observation and the actual observation\n",
    "\\begin{align}\n",
    "r_t       &=y_t-\\hat{y}_t \\\\\n",
    "\\hat{y}_t &=\\mathbb{E} \\left[y_t|y_{1:t-1},u_{1:t} \\right] =\\mathbf{C}_t\\mu_{t|t-1}+\\mathbf{D}_tu_t \\\\\n",
    "\\end{align}\n",
    "and $\\mathbf{K}_t$ is the **Kalman gain matrix**, given by\n",
    "$$\n",
    "\\mathbf{K}_t=\\Sigma_{t|t-1}\\mathbf{C}_t^T\\mathbf{S}_t^{-1}\n",
    "$$\n",
    "where\n",
    "\\begin{align}\n",
    "\\mathbf{S}_t  &=cov \\left[r_t|y_{1:t-1},u_{1:t})\\right] \\\\\n",
    "              &=\\mathbb{E}\\left[(\\mathbf{C}_tz_t+\\delta_t-\\hat{y}_t)\\mathbf{C}_tz_t+\\delta_t-\\hat{y}_t)^T |y_{1:t-1},u_{1:t})\\right]  \\\\\n",
    "              &=\\mathbf{C}_t\\Sigma_{t|t-1}\\mathbf{C}_t^T+\\mathbf{R}_t\n",
    "\\end{align}\n",
    "### Marginal likelihood\n",
    "As a byproduct of the algorithm, we can also compute the likelihood of the sequence using\n",
    "\\begin{align}\n",
    "p(y_{1:T}|u_{1:T}) &=p(y_{1:T-1},y_T|u_{1:T})  \\\\\n",
    "                   &=p(y_{1:T-1}|u_{1:T})p(y_T|y_{1:T-1},u_{1:T}) \\\\\n",
    "                   &=p(y_{1:T-1}|u_{1:T-1})p(y_T|y_{1:T-1},u_{1:T}) \\\\\n",
    "\\end{align}\n",
    "So we can use the recursive method, where $p(y_t|y_{1:t-1},u_{1:t})=\\mathcal{N}(y_t|\\mathbf{C}_t\\mu_{t|t-1}+\\mathbf{D}_tu_t,\\mathbf{S}_t)$\n",
    "### Posterior predictive\n",
    "The one-step-ahead posterior predictive density for the observations can be computed as follows\n",
    "\\begin{align}\n",
    "p(y_{T+1}|y_{1:T},u_{1:T+1}) &=\\int p(z_{T+1}|y_{1:T},u_{1:T+1})p(y_{T+1}|z_{T+1},u_{1:T+1})dz_{T+1} \\\\\n",
    "                             &=\\int \\mathcal{N}(z_{T+1}|\\mu_{T+1|T},\\Sigma_{T+1|T})\\mathcal{N}(y_{T+1}|\\mathbf{C}z_{T+1},R)dz_{T+1} \\\\\n",
    "                             &=\\mathcal{N}(y_{T+1}|\\mathbf{C}\\mu_{T+1|T},\\mathbf{C}\\Sigma_{T+1|T}\\mathbf{C}^T+\\mathbf{R})\n",
    "\\end{align}\n",
    "## 3.2 The Kalman smoothing algorithm\n",
    "In offline setting, we can wait until all the data has arrived, and then compute $p(z_t|y_{1:T})$. By conditioning on past and future data, our uncertainty will be significantly reduced (similar to the forward-backwards algorithm in HMM,but the algorithm is somewhat different, be careful).\n",
    "\\begin{align}\n",
    "p(z_t|y_{1:T})&=\\int p(z_t|y_{1:T},z_{t+1})p(z_{t+1}|y_{1:T})dz_{t+1} \\\\\n",
    "              &=\\int p(z_t|y_{1:t},z_{t+1})p(z_{t+1}|y_{1:T})dz_{t+1} \\\\\n",
    "\\end{align}\n",
    "By induction, assume we have already computed the smoothed distribution for $t+1$.\n",
    "$$\n",
    "p(z_{t+1}|y_{1:T})=\\mathcal{N}(z_{t+1}|\\mu_{t+1|T},\\Sigma_{t+1|T})\n",
    "$$\n",
    "The remaining task is to compute $p(z_t|y_{1:t},z_{t+1})$\n",
    "\\begin{align}\n",
    "p(z_t|y_{1:t},z_{t+1}) &=\\frac{p(z_t|y_{1:t})p(z_{t+1}|z_t,y_{1:t})}{p(z_{t+1}|y_{1:t})} \\\\\n",
    "                       &=\\frac{p(z_t|y_{1:t})p(z_{t+1}|z_t)}{p(z_{t+1}|y_{1:t})} \\\\\n",
    "\\end{align}\n",
    "$p(z_t|y_{1:t})$ is the result of Karman filter\n",
    "# 4. Learning for LG-SSM\n",
    "When using SSMs for time series forecasting, and also in some physical state estimation problems, the observation matrix $C$ and the transition matrix $A$ are both known and fixed, by definition of the model.In such cases, all that needs to be learned are the noise covariances $Q$ and $R$.Although we can estimate $Q$ and $R$ offline, using the methods describled below, it is also possible to derive a recursive procedure to exactly compute the posterior$p(z_t,Q,R|y_{1:t})$, which has the form of a Normal-inverse-Wishart distribution.\n",
    "## 4.1 Traning with fully observed data\n",
    "If we observe the hidden state sequences, we can fit the model by computing the MLEs (or even the full posteriors) for the parameters by solving a multivariate linear regression problem for $z_{t-1} \\rightarrow z_t$ and $z_t \\rightarrow y_t$\n",
    "## 4.2 EM for LG-SSM\n",
    "If we only observe the output sequence, we can compute ML or MAP estimates of the parameters using EM. The method is conceptually quite similar to the Baum-Welch algorithm for HMMs.except we use Kalman smoothing instead of forwards-backwards in the E step,and use different calculations in the M step."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python [default]",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
